Contact:(+44) 07967741721, stephen.newhouse@kcl.ac.uk, stephen.j.newhouse@gmail.com


Important: This document is best viewed as a html file in any web browser


Overview

ngs_pipeline


Basic analysis of NGS data for clinical genetics involves the following basic steps:

  • quality control
  • alignment (read mapping)
  • variant calling
  • variant annotation
  • variant prioritisation
  • reporting

It is essential to assess data quality and performance at each step.

At the end of this workshop you will be able to:

  1. Use the Galaxy suite of bioinformatic tools
  2. Assess the quality of raw NGS data in fastq format prior to alignment
  3. Align NGS data to the reference human genome using BWA
  4. Describe the contents of Fastq, SAM and BAM files
  5. Make an assessment of the alignment process
  6. Conduct quality control filtering of reads
  7. Visualise aligned data
  8. Call variants and conduct quality control filtering and annotation of variants
  9. Prioritise variants for follow-up/clinical reporting

Note

Work through this document at your own pace.
Please read everything carefully. Ask questions and have fun!

At the begining of some of the sections I list the main software called by Galaxy and their basic input and output and output formats.


WARNING!

Galaxy is an ongoing project, bioinformatics software are still being developed, fixed, tweaked and improved upon. As a result, some of the screenshots in the document may not exactly match what you see on the live usegalaxy.com website or the Galaxy instance we provide on the Genomics England compute environment. Software versions may differ also. Don’t panic. Keep calm. The workflow/pipeline and tools we use will stay the same - there may just be some subtle differences. Use the search bar in Galaxy to find the tool you need. Ask questions. Google it. Report any bugs and typos.



The NGS Workshop Data

The sequence data that you will be analysing is from a 3-year-old male who presented with difficulty walking after the first year of life, followed by signs of muscle wasting and weakness, muscle rigidity, developmental delays, progressive loss of vision leading to blindness, convulsions, impaired swallowing, paralysis, and dementia. An MRI scan showed a “tigroid” or “leopard-skin” appearance within the deep white matter - recognizable abnormalities of white matter. Further clinical tests suggest a diagnosis of Metachromatic leukodystrophy (MLD, also called Arylsulfatase A deficiency): is a lysosomal storage disease which is commonly listed in the family of leukodystrophies as well as among the sphingolipidoses as it affects the metabolism of sphingolipids.

The patient’s exome was sequenced using the paired-end method on an Illumina HiSeq 2500 following target enrichment using the Nextera Rapid Capture Exome and Expanded Exome kit.

This data is simulated - but it does represents a real life scenario

Illumina Sequencing

Publications:


Getting the Data

  1. Create local folder on your computer called workshop_data_2018 (or something memorable)
  2. Download the test data from the links provided below


The Workshop Data

FILE DESCRIPTION URL
WES01_chr22m_R1.fastq.gz Illumina fastq file read 1 click me
WES01_chr22m_R2.fastq.gz Illumina fastq file read 2 click me
chr22.genes.hg19.bed annotation bed file click me

You should have a folder that looks somthing like:-

workshop_folder



Lets get started!..


1. Galaxy

You will be using Galaxy to construct and run your NGS pipeline. Galaxy is an open source, web-based platform for data intensive biomedical research. If you are new to Galaxy start here and/or consult the help resources:

Learning Resources

  • Galaxy NGS 101: https://wiki.galaxyproject.org/Learn/GalaxyNGS101
    This page is designed to serve as a comprehensive resource describing manipulation and analysis of NGS data. While it is designed to highlight the utility of Galaxy it will also provide information that is broadly applicable and can be used for teaching of big data biology. It is accompanied by a collection of screencasts.

Getting more help on Galaxy


1.1 Sign in to Galaxy

You should have all registered for a Galaxy account!

  1. Go to https://usegalaxy.org/ and sign in

INFO: The Galaxy Interface

  • The available “Tools” are in the left tool pane.
  • The central panel is your working area where you can visualise your data, select tool options and execute jobs.
  • Your “History” is shown in the right hand panel. Here you will see your data, results and a history of all the tools that you have used.



INFO: The Galaxy Tool Interface

A lot of useful information is displayed in the centre panel. Don’t ignore it!

An example is shown below. This is for the software Freebayes: a Bayesian genetic variant detector designed to find small polymorphisms, specifically SNPs (single-nucleotide polymorphisms), indels (insertions and deletions), MNPs (multi-nucleotide polymorphisms), and complex events (composite insertion and substitution events) smaller than the length of a short-read sequencing alignment.

  • Tool name, Galaxy Version, and tool version selector and other galaxy options
  • Tool options e.g. set input files, reference genome version, tool specific parameters and execute.
  • What it does
  • Description
  • Galaxy-specific options
  • Acknowledgments
  • Citations

You will get used to this as you work through the workshop


1.2 Start a new Galaxy History

  1. Rename the history for this session (Click on name, type new name, press return)
  2. Use the Name: WES01 NGS Workshop .


history_name

1.3 Load the Data

  • Click Get Data > Upload File .



  • Click Chose local file .



  • Navigate to workshop_data_2018 on your local computer and select all 3 files and click open .



  • The files will appear in the Download from web or upload from disk window .



  • You now need to tell Galaxy what type of data these are and what version of the human genome reference we wish to use .

INFO: The NGS Data Files and Formats

Paired End Sequence data : FASTQ Format

One lane of paired end sequencing was performed so you have two files of raw sequence data (WES01_chr22m_R1.fastq.gz and WES01_chr22m_R2.fastq.gz) which contain all the sequence data for read 1 (R1) and read 2 (R2) respectively (Figure: Paired End Sequence data). To save on computing time and disk space, the NGS data for WES01 has been filtered to contain reads mapping to chromosome 22 only and the files have been compressed (hence the .gz extension).

Figure: Paired End Sequence data


The raw NGS data is held in fastq format: http://en.Wikipedia.org/wiki/FASTQ_format. Fastq files are the simplest and most generic way of storing read sequences and qualities.

  • Line 1 begins with a ‘@’ character and is followed by a sequence identifier and an optional description (like a FASTA title line).
  • Line 2 is the raw sequence letters.
  • Line 3 begins with a ‘+’ character and is optionally followed by the same sequence identifier (and any description) again.
  • Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence.

Figure: Fastq Data

@HWI-D00119:50:H7AP8ADXX:1:1212:10369:36657/1
GCACGTGGGTTCCCAGTAGCAAGAAACTAAAGGGTCGCAGGCCGGTTTCTGCTAATTTCTTTAATTCCAAGACAGTCTCAAATATTTTCTTATTAACTTCC
+
CCCFFFFFHHHHHJJJHIJJJJJJJJJJJJJJJJHFGIIIGJJJJIIHHHHHHFFFFFFFEEEEEEEEEDDDDDDDDDDEDDDDFFEEEDDDDEEEDDEDD

Illumina sequence identifiers

Sequences from the Illumina software use a systematic identifier.

@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG

Value Description
EAS139 the unique instrument name
136 the run id
FC706VJ the flowcell id
2 flowcell lane
2104 tile number within the flowcell lane
15343 ‘x’-coordinate of the cluster within the tile
197393 ‘y’-coordinate of the cluster within the tile
1 the member of a pair, 1 or 2 (paired-end or mate-pair reads only)
Y Y if the read is filtered, N otherwise
18 0 when none of the control bits are on, otherwise it is an even number
ATCACG index sequence

Note that more recent versions of Illumina software output a sample number (as taken from the sample sheet) in place of an index sequence. For example, the following header might appear in the first sample of a batch:

@EAS139:136:FC706VJ:2:2104:15343:197393 1:N:18:1

Phred Quality Scores

A Phred quality score is a measure of the quality of the identification of the sequence of bases generated by automated DNA sequencing.

A phred score or quality value Q is an integer mapping of p (i.e., the probability that the corresponding base call is incorrect).



It is more efficient to store the sequence quality scores as characters because they require less disk space than numbers (i.e. 1 character versus multiple digits). Characters are also more reliable and portable than numbers. Sequence quality scores on the phred scale are determined by mapping the ASCII character to its associated number.


encoding_scores


BED File Annotaion data: chr22.genes.hg19.bed

The uploaded file chr22.genes.b37.bed describes the exome sequences and genes that have been targeted in your trial data - in .bed format. A BED file (.bed) is a tab-delimited text file that defines genomic features. The BED file format is described on the UCSC Genome Bioinformatics web site: http://genome.ucsc.edu/FAQ/FAQformat.

Briefly, the BED format consists of one line per feature, each containing 3-12 columns of data, plus optional feature definition lines. A minimal .bed file uses 3-4 tab delimited columns of data to describe the target:

The first three fields in each feature line are required:

  1. chrom - name of the chromosome or scaffold. Any valid seq_region_name can be used, and chromosome names can be given with or without the ‘chr’ prefix.
  2. chromStart - Start position of the feature in standard chromosomal coordinates (i.e. first base is 0).
  3. chromEnd - End position of the feature in standard chromosomal coordinates
  4. name - Label to be displayed under the feature, if turned on in “Configure this page”.
  • a quick look at chr22.genes.hg19.bed
chr22   16256331    16256677    POTEH-chr22-16256332-16256677
chr22   16258184    16258303    POTEH-chr22-16258185-16258303
chr22   16266928    16267095    POTEH-chr22-16266929-16267095
chr22   16269872    16269943    POTEH-chr22-16269873-16269943
chr22   16275206    16275277    POTEH-chr22-16275207-16275277
chr22   16277747    16277885    POTEH-chr22-16277748-16277885
chr22   16279194    16279301    POTEH-chr22-16279195-16279301
chr22   16287253    16287937    POTEH-chr22-16287254-16287937
chr22   16448825    16449804    OR11H1-chr22-16448826-16449804
chr22   17071647    17073700    CCT8L2-chr22-17071648-17073700


1.4 Set Data Type and Genome Build

  1. Set the data type for each data set. Click Type and select the data format from the drop down menu:
    i. chr22.genes.hg19.bed Type = bed
    ii. WES01_chr22m_R1.fastq.gz Type = fastqsanger
    iii. WES01_chr22m_R2.fastq.gzType = fastqsanger
  2. Select Genome (set all): from the bottom right of the upload data window
  3. From the drop-down menu, start typing hg19 and select Human Feb. 2009 (GRCh37/hg19)
  4. Click Start. .



  • The data will begin to upload to your session. You will see the green progress bar “Adding to history”. This will take a few minutes (or longer, depending on network speeds and the number of other users logged in world wide!). .



  • When the data has finished uploading, click Close. .


close_icon

  • The uploaded data will appear in the history panel:- .



  • You can view the raw data by clicking on the eye icon .




2. Pre-Alignment QC

Software

Tools

  • fastqc
  • trimmomatic

Input/Output

  • fastqc input: raw fastq files
    • WES01_chr22m_R1.fastq.gz
    • WES01_chr22m_R2.fastq.gz
  • fastqc output: html and text file
  • trimmomatic input:
    • Nextera adapter sequences (provided by galaxy)
    • WES01_chr22m_R1.fastq.gz
    • WES01_chr22m_R2.fastq.gz
  • trimmomatic output: quality trimmed fastq files (4 files)
    • trimmomatic WES01_chr22m_R1.fastq.gz (paired)
    • trimmomatic WES01_chr22m_R2.fastq.gz (paired)
    • trimmomatic WES01_chr22m_R1.fastq.gz (unpaired)
    • trimmomatic WES01_chr22m_R2.fastq.gz (unpaired)

2.1 Pre-Alignment QC: View fastq data files

FASTQ_Format: A FASTQ file normally uses four lines per sequence.

Line 1 begins with a ‘@’ character and is followed by a sequence identifier and an optional description (like a FASTA title line).
Line 2 is the raw sequence letters.
Line 3 begins with a ‘+’ character and is optionally followed by the same sequence identifier (and any description) again.
Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence.

  • To view your raw fastq data click on the eye icon .




Illumina sequence identifiers

Sequences from the Illumina software use a systematic identifier:

@HWUSI-EAS100R:6:73:941:1973#0/1

ID Description
HWUSI-EAS100R the unique instrument name
6 flowcell lane
73 tile number within the flowcell lane
941 ‘x’-coordinate of the cluster within the tile
1973 ‘y’-coordinate of the cluster within the tile
#0 index number for a multiplexed sample (0 for no indexing)
/1 the member of a pair, /1 or /2 (paired-end or mate-pair reads only)

Versions of the Illumina pipeline since 1.4 appear to use #NNNNNN instead of #0 for the multiplex ID, where NNNNNN is the sequence of the multiplex tag.

With Casava 1.8 the format of the ‘@’ line has changed:

@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG

ID Description
EAS139 the unique instrument name
136 the run id
FC706VJ the flowcell id
2 flowcell lane
2104 tile number within the flowcell lane
15343 ‘x’-coordinate of the cluster within the tile
197393 ‘y’-coordinate of the cluster within the tile
1 the member of a pair, 1 or 2 (paired-end or mate-pair reads only)
Y Y if the read is filtered, N otherwise
18 0 when none of the control bits are on, otherwise it is an even number
ATCACG index sequence

Note that more recent versions of Illumina software output a sample number (as taken from the sample sheet) in place of an index sequence. For example, the following header might appear in the first sample of a batch:

@EAS139:136:FC706VJ:2:2104:15343:197393 1:N:18:1


  • View the fastq data files and make a note of the following information: .


1. The unique instrument name
2. The run id
3. The flowcell id
4. The flowcell lane

  • This can all be obtained from the sequence identifier i.e. line 1 of the first fastq entry:- .


Fastq Header Explaination

Header: @HWI-D00119:50:H7AP8ADXX:1:1212:10369:36657/1
            |       |     |      |
mapping:    1       2     3      4
  1. The unique instrument name: HWI-D00119
  2. The run id: 50
  3. The flowcell id: H7AP8ADXX
  4. The flowcell lane: 1
  • We will use this later to add meta data (i.e. read group information) and sample information to our NGS analysis. .



2.2 Pre-Alignment QC: Assess the quality of raw sequence data

To assess the quality of the raw sequence data and to guide quality control we will use a program called FastQC. The program outputs summary graphs and tables that show if there are any problem areas, which could influence assembly or variant calling if not addressed.

You can learn more about the program here (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/).

  1. In Tool Pane: Go to NGS: QC and manipulation > FastQC Read Quality reports
  2. Select multiple datasets, highlight both fastq files and click Execute .



The FastQC reports will be shown in the History Pane. Queued jobs in grey with a clock symbol, running jobs in yellow with a buffering symbol, finished jobs in green, failed jobs in red with a cross. .



  • Click View data button for FastQC on data 3:Webpage to view the report .



Use the FastQC reports on data 1 and 2 to answer:

  1. How many reads do the files contain?
  2. How long are the reads (bp)?
  3. Has either file failed any of the sequence quality checks?

INFO: Per base sequence quality for WES01_chr22m_R1.fastq



Looking at the per base sequence quality you will notice that the average base quality drops towards the end of reads (Figure above). This drop in quality is typical for ensemble-based sequencing by synthesis (SBS) methods, such as Illumina, which add complimentary bases one at a time in a cluster of identical sequences to determine a consensus sequence from the ‘average’ sequence signal over all copies in the cluster. As nucleotides are added some of the sequences in a cluster grow at a different rate and become desynchronized which reduces the accuracy of the ‘average’ sequence signal

INFO: Read-length and phasing (From Fuller et al. 2009: Nat Biotechnol. doi: 10.1038/nbt)



Another particularly important plot is that of ‘Overrepresented sequences’, which lists sequences that account for more than 0.1% of the total. The presence of an overrepresented sequence suggests that the sequence is biologically significant, or that the library is contaminated or has low diversity. To check for contamination, each overrepresented sequence is compared to a database of common contaminants such as sequencing adaptors which can then be removed from the raw FastQ data.


2.3 Pre-Alignment QC: Filter reads based on quality

In a typical analysis you may want to raise technical issues identified by FastQC such as low read count, poor quality, and overrepresented sequences with the data provider. To ensure that only data of a certain quality is used for further analysis we will exclude low quality reads. In paired-end data there are two fastq files per lane of sequencing which are synchronised so that matching pairs are stored in the same line of each file (e.g. the read in line 1, file 1 is paired with the read in line 1, file 2 and so on).

To maintain this order when filtering, we will filter the fastq files using TRIMMOMATIC (Bolger, A. M., Lohse, M., & Usadel, B. (2014). Trimmomatic: A flexible trimmer for Illumina Sequence Data. Bioinformatics, btu170.) the fastq files need to be joined and the reads have to be removed as a pair.

  • In Tool Pane: Go to NGS: QC and manipulation > Trimmomatic
  • Select the two reads .



  • Select YES for Perform initial ILLUMINACLIP step?
  • In the dropdown box select Nextera (paired-end) .


Nextera adapters are:

- Read1  CTGTCTCTTATACACATCTCCGAGCCCACGAGAC  
- Read2  CTGTCTCTTATACACATCTGACGCTGCCGACGA  


Additional Reference: what-sequences-do-i-use-for-adapter-trimming



2.3.1 Trimmomaitc Operations

  • Keep the first Trimmomatic Operation: Sliding Window trimming
  • Add an additional Trimmomatic Operation QC step
  • Under Trimmomatic Operation, Click: + Insert Trimmomatic Operation .



  • Select > Drop reads below a specified length (MINLEN) set this to 50 .



You should now have set up 2 Trimmomatic Operations

  1. Sliding Window (number of bases:4, average quality 20)
  2. Drop reads below a specified length (50bp)
  • Click Execute .


For paired-end data a particular strength of Trimmomatic is that it retains the pairing of reads (from R1 and R2) in the filtered output files:

Two FASTQ files (R1-paired and R2-paired) contain one read from each pair where both have survived filtering. Additionally two FASTQ files (R1-unpaired and R2-unpaired) contain reads where one of the pair failed the filtering steps.

When the job has successsfully run you should see something like:



2.4 Rerun FastQC on the filtered reads to see the effect of filtering.

  • In Tool Pane: Go to NGS: QC and manipulation > FastQC Read Quality reports
  • Select > 8: Trimmomatic on WES01_chr22m_R1.fastq (R1 paired) and 9: Trimmomatic on WES01_chr22m_R2.fastq (R2 paired)
  • Execute .




  • Look at the results for the paired files (e.g 12 FastQC on data 8: Webpage) : .



  1. How many reads do the files contain?
  2. How long are the reads (bp) and is the quality distribution better? There are other ways to clean up reads such as trimming or filtering by other criteria that could be used depending on the data quality and requirements.

3. Alignment

Software

Tools

  • bwa

Input/Output

  • bwa input: Reference Genome and indexes for use with bwa (provided by galaxy)
    • hg19 (fasta file)
    • bwa indexes:
      • .amb is text file, to record appearance of N (or other non-ATGC) in the ref fasta.
      • .ann is text file, to record ref sequences, name, length, etc.
      • .bwt is binary, the Burrows-Wheeler transformed sequence.
      • .pac is binary, packaged sequence (four base pairs encode one byte).
      • .sa is binary, suffix array index.
  • bwa input: paired fastq files
    • trimmomatic WES01_chr22m_R1.fastq.gz (paired)
    • trimmomatic WES01_chr22m_R2.fastq.gz (paired)
  • bwa output: SAM/BAM file
    • BAM file
    • BAI (BAM Index: automatically produced by galaxy)

3.1 Alignment: Align sequence data to the reference human genome

We will use a program called BWA-MEM (Burrows-Wheeler Alignment) to align the sequence data to the human genome. You can learn more about the bwa program here (http://bio-bwa.sourceforge.net/).

3.2 BWA MEM basic settings:

  1. In Tool Pane: Go to NGS:Mapping > BWA-MEM
  2. Select the following options:
    1. Use built in genome index > Human (Homo sapiens (b37) hg19
    2. Single or Paired-end reads > Paired
    3. Select first set of reads > 8: Trimmomatic on WES01_chr22m_R1.fastq (R1 paired)
    4. Select second set of reads > 9: Trimmomatic on WES01_chr22m_R2.fastq (R1 paired)
    5. Enter mean, standard deviation.. > 250,50
      .



3.3 Set read groups (SAM/BAM specification):

This next step is very important!

  • Select Set read groups information? > Set read groups (SAM/BAM specification):
  • Enter the following for the options:
    • Read group identifier (ID): HWI-D0011.50.H7AP8ADXX.1.WES01
    • Read group identifier (SM): WES01
    • Read group identifier (PL): ILLUMINA
    • Read group identifier (LB): nextera-wes01-blood
    • Read group identifier (PU): HWI-D00119
      -Read group identifier (DT): 2017-02-23

IMPORTANT:READ: https://galaxyproject.org/learn/galaxy-ngs101/#read-groups for more information on read groups. .



  • Now click execute

INFO: The Binary Alignment/Map file (BAM)

The alignment process maps the read data to the reference human genome creating a Binary Alignment/Map file (BAM). The BAM file is not directly viewable , clicking the view button will download the file.

INFO: Understanding and manipulating SAM/BAM datasets

IMPORTANT:READ: https://galaxyproject.org/learn/galaxy-ngs101/#understanding-and-manipulating-sambam-datasets for more information.


4. Post Alignment QC and Filtering

Software

Tools

  • samtools tool_option
  • picard tools tool_option
  • bedtools tool_option

Input/Output

  • picard tools MarkDuplicates input: bwa aligned BAM file
  • picard tools MarkDuplicates output: duplicate read marked BAM file

  • samtools view input: duplicate read marked BAM file
  • samtools view output: Filtered duplicate read marked BAM file and related bai index

  • samtools Flagstat input: Filtered duplicate read marked BAM file
  • samtools Flagstat output: text file

  • samtools IdxStats input: Filtered duplicate read marked BAM file
  • samtools IdxStats output: text file

  • picard tools CollectInsertSizeMetrics input: Filtered duplicate read marked BAM file
  • picard tools CollectInsertSizeMetrics output: text and pdf (a plot) file

  • bedtools coverageBed input:
    • Filtered duplicate read marked BAM file
    • Target intervals file .bed file
      • chr22.genes.hg19.bed
  • bedtools output: text file

4.1 Post Alignment QC and Filtering

For variant calling, reads with low mapping quality (phred <20), unmapped reads, secondary alignments, reads failing platform/vendor quality checks, and duplicate reads that have the same start and stop position are typically removed or ignored because they can influence genotyping accuracy. For example, at heterozygous sites the two alleles should be evenly distributed (50% of reads have the A allele and 50% have the B allele) but if reads with the A allele are duplicated the A allele will become overrepresented and the site might be misinterpreted as homozygous for the A allele.

We will Use 1) NGS: Picard > MarkDuplicates and then 2) SAMtools > Filter SAM or BAM, output SAM or BAM files to make a new BAM file which flags duplicate molecules and excludes poor quality reads.

4.2 MarkDuplicates

INFO: How PCR duplicates arise in next-generation sequencing. Duplicates originate mostly from DNA prep methods an cause biases that skew variant calling results.

  1. In Tool Pane: Go to NGS: Picard > MarkDuplicates
  2. Select > 16 Map with BWA-MEM on data 9 and data 8 (mapped reads in BAM format)
  3. Leave everything else the same
  4. Execute



This tool examines aligned records in the supplied SAM or BAM dataset to locate duplicate molecules. All records are then written to the output file with the duplicate records flagged. Two files are produced, 1) the new BAM file with duplicate reads marked and 2) a metrics file summarising the number of duplicate reads found.



4.3 Filter BAM based on mapping quality and bitwise flags

  1. In Tool Pane: Go to NGS: SAMtools > Filter SAM or BAM, output SAM or BAM files
  2. Select > 18 MarkDuplicates on data 16: MarkDuplicates BAM output
  3. Set Minimum MAPQ quality score : 20
  4. Filter on bitwise flag: yes a. Skip alignments with any of these flag bits set
    i. The read is unmapped
    ii. The alignment or this read is not primary
    iii. The read fails platform/vendor quality checks
    iv. The read is a PCR or optical duplicate
  5. Leave everything else the same
  6. Execute



This tool uses the samtools view command in SAMtools toolkit to filter a SAM or BAM file on the MAPQ (mapping quality), FLAG bits, Read Group, Library, or region.

Input is either a SAM or BAM file.

The output file will be SAM or BAM (depending on the chosen option), filtered by the selected options.

4.4 Alignment Statistics: Flagstats

When aligning reads to the reference genome anywhere between 0 to 20% of reads are not aligned due to sequencing errors, sample contamination (e.g. bacterial or viral DNA), gaps in the reference genome and genome variation. Use SAMtools Flagstat to determine how many reads have been aligned. We will compare the duplicate marked bam file with the final filtered bam file.

  1. In Tool Pane: Go to NGS SAMtools > Flagstat tabulate descriptive stats for BAM dataset
  2. Select: 18: MarkDuplicates on data 16: MarkDuplicates BAM output first
  3. Execute
  4. Repeat for data set 19 Filter SAM or BAM, output SAM or BAM on data 18: baming else the same
  5. Execute




Click the view data icon for both and use the Flagstat output to determine the percentage of mapped reads, and calculate the number of reads that were filtered out (difference in total read count between the raw and filtered BAM files).


Before Filtering

After Filtering


4.5 Alignment Statistics: Viewing the BAM File

Converting BAM to SAM file:As mentioned above, the BAM file is held in binary format and so it can not be viewed directly. However, the BAM file can be converted to a viewable text format known as a Sequence Alignment/Map file or SAM file. The SAM file format is described in detail here (http://samtools.github.io/hts-specs/SAMv1.pdf).

Use BAM to SAM to convert the filtered BAM file to a SAM file.

  1. In Tool Pane: Go to NGS: SAMtools > BAM-to-SAM convert BAM to SAM
  2. Select the filtered bam 19 Filter SAM or BAM, output SAM or BAM on data 18: bam
  3. Execute
  4. Click view data to view the SAM file.



Spend some time looking at this file and its format.

  • Optional. Repeat this step, but this time exclude the SAM header (Header options > Exclude header). This will make it easier to view the aligned reads.


Sam format: basic column information: In the SAM file, lines starting with “@” are headers, and those without “@” are read sequences aligned to the reference.

SAM_Column Description
QNAME Query name of the read or the read pair
FLAG Bitwise flag (pairing
RNAME Reference sequence name
POS 1-Based leftmost position of clipped alignment
MAPQ Mapping quality (Phred-scaled)
CIGAR Extended CIGAR string (operations MIDNSHP)
MRNM Mate reference name (‘=’ if same as RNAME)
MPOS 1-based leftmost mate position
ISIZE Inferred insert size
SEQQuery Sequence on the same strand as the reference
QUAL Query quality (ASCII-33=Phred base quality)


I have extracted two reads for you to look at. Note that they both have the same read id. HWI-D00119:50:H7AP8ADXX:1:2202:7515:35573

HWI-D00119:50:H7AP8ADXX:1:2202:7515:35573   99  chr22   16123342    60  101M    =   16123407    166 CTACCTCCAGAGCATCAGGCCTTGCCTGGCCCTGGGCAGAGAGCTGGTCTACAACCGGCGGCCTGGGGACGAGGGCACTGTCATGTCTCTAGCTGGGAAAT   CCCFFFFFHHHHHIJIJJJIJJJIJIIJIIJJIJIIJJIJJJJIIIJHJJJJJIGGHIGFDDDDDDDDDDDD@BBDDDDDDDDDDCDEDDEDDDDDDDCDD   XA:Z:chr14,-19719869,101M,2;chr14,+19855438,101M,2; MD:Z:101    PG:Z:MarkDuplicates RG:Z:HWI-D0011.50.H7AP8ADXX.1.WES01 NM:i:0  AS:i:101    XS:i:91

HWI-D00119:50:H7AP8ADXX:1:2202:7515:35573   147 chr22   16123407    60  101M    =   16123342    -166    GGGACGAGGGCACTGTCATGTCTCTAGCTGGGAAATACACATTGAACAACTGGTTGGCAACGTTAACGTTGAGCCAGGCGGGCATGCACGCAACATACTAC   B@DDDDDDDDDDDDDDDDDDDDDDEEDEEFFEFFFEEHHHHFHGJJJJJJIIIIJJIIJJIJJJIGGJHEHEGHHFIIJJIJJIJIJJHHHHHFFFFF@B@   XA:Z:chr14,+19719804,101M,2;chr14,-19855503,101M,3;chr2,+132481028,101M,3;  MD:Z:101    PG:Z:MarkDuplicates RG:Z:HWI-D0011.50.H7AP8ADXX.1.WES01 NM:i:0  AS:i:101    XS:i:91


SAM/BAM Format mapping:

QNAME: HWI-D00119:50:H7AP8ADXX:1:2202:7515:35573
FLAG: 147
RNAME: chr22
POS: 16123407
MAPQ: 60
CIGAR: 101M
MRNM: =
MPOS: 16123342
ISIZE: 166
SEQQuery: GGGACGAGGGCACTGTCATGTCTCTAGCTG…
QUAL: B@DDDDDDDDDDDDDDDDDDDDDDEEDEEF



SAM/BAM Flags Explained

Use this website (http://broadinstitute.github.io/picard/explain-flags.html) to decode the flags (99 and 147) of the reads and determine which strand of the reference genome they are aligned with.


4.6 Alignment Statistics: Generate alignment statistics per chromosome

When viewing the SAM file you may have noticed that not all the reads are mapped to chromosome 22 which is surprising as the sequence was initially filtered for reads mapping to chromosome 22 only. Use the IdxStats to generate a breakdown of the number of reads that map to each chromosome or contig.

  1. In Tool Pane: Go to NGS: SAMtools > IdxStats
  2. Select the filtered BAM file (or the newly created SAM file) and click execute
  3. Click view data.


This will produce a file with 4 columns; 1) Reference sequence identifier. 2) Reference sequence length. 3) Number of mapped reads. 4) Number of placed but unmapped reads (typically unmapped partners of mapped reads)

Use the IdxStats output to calculate the percentage of reads mapping to chromosome 22.





4.7 Alignment Statistics: Determine the distribution of insert sizes

Insert sizes (the region between the 5’ ends of the paired reads see Figure belwo are important for correct alignment and variant calling. During alignment with BWA-MEM, we estimated that the mean insert size was 250bp so the program expected reads to be separated by this distance plus or minus the standard deviation. For variant calling, reads are assumed to be independent. However, if an insert is smaller than the sum of the read pairs the reads will overlap and not be independent in the overlapping region. If a PCR error occurs in this overlapping region it will be present in both reads which may result in there being enough evidence to call a variant at this site that is not real.

Overalpping reads duplicating a PCR error shown in yellow


Use Picard CollectInsertSizeMetrics to produce some statistics and a histogram of the insert size.

  1. In Tool Pane: Go to NGS: Picard > CollectInsertSizeMetrics
  2. Select the filtered bam file, Human hg19 reference genome, don’t change the other settings and click execute.



This tool will produce two output files. The first is a tabular output with some statistics and a list of insert sizes and counts. The second output is a .pdf with a insert size histogram, that should look like the figure on the next page;

View the tabular output and record the mean insert size and standard deviation



Insert size histogram


4.8 Alignment Statistics: Depth of Coverage

Identification and accuracy of variant calling relies on the depth and breadth of sequence coverage. To calculate coverage, a file in bed format describing the target region is required. Bed files use tab delimited columns of data to describe the target: chromosome, left location (bp), right location (bp).

We will used the uploaded bed file which describes the exome sequence that has been targeted in your trial data “chr22.genes.hg19.bed”.

4.8.1 BEDTools (coverageBed)

  1. In Tool Pane: BEDTools > Compute both the depth and breadth of coverage of features in file A across the features in file B (coverageBed)
  2. Make sure Version matches Version in picture below i.e it should be 2.22.1)
  3. If it does not match, click the Grey Versions tab next to the Options tab and then select Version 2.22.1
  4. Select the filtered bam file
  5. Select overlap the intervals in this BED file (target) > chr22.genes.hg19.bed
  6. Report the depth at each position in each B feature > YES
  7. Execute



The options we selected will report the per-base of coverage for each feature in the chr22.genes.hg19.bed.

The output will consist of a line for each one-based position in each feature, followed by the coverage detected at that position.

the output explained: The 6th (last) column is the Depth at position X (Column 5) for the genomic feature (exon) A (columns 1-4).

chr22   16256331    16256677    POTEH-chr22-16256332-16256677   1   16
chr22   16256331    16256677    POTEH-chr22-16256332-16256677   10  19
chr22   16256331    16256677    POTEH-chr22-16256332-16256677   100 90

For line 1: the depth at the first base (indicated by col 5) of the feature/exon POTEH-chr22-16256332-16256677 is 16.
For line 2: the depth at the 10th base (indicated by col 5) of the feature/exon POTEH-chr22-16256332-16256677 is 18.
For line 3: the depth at the 100th base (indicated by col 5) of the feature/exon POTEH-chr22-16256332-16256677 is 90.

Note:order in column 5. This is not numerical or in order. This is a quirk of sorting bed files

4.8.2 Correct Sort order:

  1. In Tool Pane: Go to Filter and Sort > Sort data in ascending or descending order
  2. Select : Count of overlaps in Filter SAM or BAM, output SAM or BAM on data 20: bam on chr22.genes.hg19.bed
  3. In Column selections, Click + Insert Column Selections
  4. Add Sort on Column 2. Leave all the other options as they are.
  5. In Column selections, Click + Insert Column Selections
  6. Add Sort on Column 5. Leave all the other options as they are.
  7. Execute:-



View the new sorted data It should look like:-



4.8.3 Mean Coverage

Mean Coverage: We can calculate the mean coverage of the targeted regions by averaging over the values in column 6:

  1. In Tool Pane: Go to Statistics > Summary Statistics
  2. Select the sorted data
  3. Set Column or expression > c6 (the depth column)
  4. Execute



View the data It should look like:-



What is the average depth of coverage?


5. Visualise alignments: Integrated Genome Viewer (IGV)

Software

Tools

  • IGV

Input/Output

  • IGV input:
    • BAM and BAI file
    • reference genome (IGV provided)
    • optional additional files
      • vcf from variant calling
      • bed files (gene annotations)
  • IGV output:
    • Interactive visual display of data



5.1 Visualise alignments using Integrated Genome Viewer (IGV)


It is useful to look at the aligned data as this is one way to identify variants, assess their quality, and determine their genomic context with respect to other annotated features of the human genome such as genes, repeat sequences, transcription factor binding sites etc. We therefore recommend alignment visualisation before validation of in-silico variants by independent sequencing methods. The Integrative Genome Viewer (IGV) is a popular tool for interrogating aligned NGS data (look here for more details on IGV: www.ncbi.nlm.nih.gov/pubmed/22517427).

You can run IGV from Galaxy or Download and install your own local version. If this does not work on the University network computers, don’t worry, try this at home on your own computer. Your galaxy sessions will remain as long as you don’t delete it.

Visualizing multiple datasets in Integrated Genome Viewer (IGV) - https://wiki.galaxyproject.org/Learn/GalaxyNGS101#Visualizing_multiple_datasets_in_Integrated_Genome_Viewer_.28IGV.29

Galaxy has a number of display applications allowing visualization of various datasets. IGV (integrative genome viewer) is one of the most versatile applications for looking at positional genomic data. In Galaxy you can view Interval, BED, BAM, and VCF datasets in IGV. The screencast Video NGS101-12 = Displaying multiple tracks in IGV : https://vimeo.com/123414437 shows how to do this.

5.2 Viewing the BAM File

  1. Naiviage to your filtered BAM data set and open the panel
  2. Click on the Human hg19 tab
  3. This will ask you to download a file called igv.jnlp. Save this in your workshop folder data.
  4. Double click igv.jnlp. This will launch IGV.
  5. When IGV opens, make sure the reference genome is set to hg19. However, you will not see any aligned reads in the central pane because you are looking at the whole genome, which is too zoomed out and you only have data for the exonic regions of chromosome 22.
  6. Load public annotation tracks into IGV. Click ‘File’ tab and select ‘Load from server’ in the drop down menu.
  7. Select Annotations > Phenotype and disease associations > OMIM




Alternatively

  1. Download IGV from http://software.broadinstitute.org/software/igv/download and install it on your computer.
  2. Downlaod the BAM and BAM Index Files
  3. Click on the filtered BAM dataset
  4. Click the grey icon that looks lile a floppy disk
  5. You will presented with two options 1) and 2)
  6. Download both
  7. Open IGV
  8. Click File and select Load from File
  9. Upload the BAM file only


View the Gene SMARCB1

  1. Enter SMARCB1 in the search box and click enter to go to this gene. We can now see data for the whole gene as a coverage profile in the upper track and individual reads below.


SMARCB1 is a tumour suppressor gene that regulates cell cycle, growth and differentiation. An inactivating germline mutation in exon 1 of SMARCB1 has been reported in patients with schwannomatosis (http://omim.org/entry/162091). Schwannomas are mostly benign tumours involving schwann cells that myelinate the axons of nerve cells but can cause problems if the tumour compresses a nerve. There are several cases where people with schwannomatosis have developed hearing loss due to an acoustic neuroma, which is a schwannoma on the vestibular nerve in the brain that is involved in hearing.

Things to think about: What does the coverage look like? Are all exons equally captured? Zoom in on an exon and get familiar with IGV and viewing the BAM file.



A Variant in SMARCB1

  1. Enter the location ‘chr22:24145525’ in the search box to focus on the data for a particular variant. You can now see the coloured bar consists of two colours representing the two alleles and their height corresponds with the allele frequency.
  2. Right click and select sort alignments by base.
  3. Mouse over the coverage track for the SNV at chr22:24145525 and look at the information IGV displays.



6. Variant Calling

Software

Tools

  • freebayes
  • vcflib tool_option

Input/Output

  • freebayes input:
    • Filtered duplicate read marked BAM file
    • Reference Genome hg19 (fasta file, provided by galaxy)
  • freebayes output:
    • vcf file
  • vcflib vcffilter input: vcf file
  • vcflib vcffilter output: filtered vcf file

6.1 Variant Calling

Accurate and consistent variant calling requires statistical modelling and is essential for the clinical implementation of NGS. However, many programs are available for calling variants and their concordance varies. Furthermore, variants have different levels of confidence due to differences in data quality. For variants with intermediate confidence levels, it is difficult to separate true variation from artefacts that arise from many factors such as sequencing error, misalignment and inaccurate base quality scores. As a result, the evidence for variant calls requires scrutiny and caution should be used when interpreting positive and negative findings especially for indels which are more error prone. At the end of this exercise you will be able to:

  1. Use a a state of the art software (Freebayes) to call small variants (SNVs and indels)
  2. Describe the contents of variant call files
  3. Generate and interpret basic variant quality control parameters
  4. Use quality control filters to exclude or flag variants with low confidence

6.2 Variant calling with Freebayes

FreeBayes is a Bayesian genetic variant detector designed to find small polymorphisms, specifically SNPs (single-nucleotide polymorphisms), indels (insertions and deletions), MNPs (multi-nucleotide polymorphisms), and complex events (composite insertion and substitution events) smaller than the length of a short-read sequencing alignment. See https://github.com/ekg/freebayes for details on FreeBayes.

  1. In Tool Pane: Go to NGS: Variant Analysis > FreeBayes
  2. Select the filtered BAM file
  3. Use reference genome > Human (Homo sapiens (b37) hg19
  4. Leave all other options as they are
  5. Execute


FreeBayes uses short-read alignments (BAM files with Phred+33 encoded quality scores, now standard) for any number of individuals from a population and a reference genome (in FASTA format) to determine the most-likely combination of genotypes for the population at each position in the reference. It reports positions which it finds putatively polymorphic in variant call file (VCF) format. It can also use an input set of variants (VCF) as a source of prior information, and a copy number variant map (BED) to define non-uniform ploidy variation across the samples under analysis.


6.3 View the VCF



  • View the VCF file created by freebayes and familiarise yourself with the variables by reading the comments in the header section
  • This history panel shows: 5,017 variants, 57 comments
  • The comments appear at the top of the VCF file (lines begin with a ‘#’ character) and explain the format of the info and sample columns


FreeBayes emits a standard VCF 4.1 output stream. This format is designed for the probabilistic description of allelic variants within a population of samples, but it is equally suited to describing the probability of variation in a single sample.

Of primary interest to most users is the QUAL field, which estimates the probability that there is a polymorphism at the loci described by the record. In freebayes, this value can be understood as 1 - P(locus is homozygous given the data). It is recommended that users use this value to filter their results, rather than accepting anything output by freebayes as ground truth.

By default, records are output even if they have very low probability of variation, in expectation that the VCF will be filtered.




6.4 Filtering the VCF

  1. In Tool Pane: Go to NGS: VCF Manipulation > VCFfilter
  2. Select the VCF file
  3. Specify filtering expression: -f "QUAL > 1 & QUAL / AO > 10 & SAF > 0 & SAR > 0 & RPR > 1 & RPL > 1"
  4. Leave all other options as they are
  5. Execute



See: freebayes in depth: model, filtering, and walkthrough

The Freebayes Information Fields we used for filtering

  • QUAL=probability that there is a polymorphism at the loci described by the record: 1 - P(locus is homozygous given the data).
  • AO=Alternate allele observations, with partial observations recorded fractionally
  • SAF=Number of alternate observations on the forward strand
  • SAR=Number of alternate observations on the reverse strand
  • RPL=Reads Placed Left: number of reads supporting the alternate balanced to the left (5’) of -the alternate allele
  • RPR=Reads Placed Right: number of reads supporting the alternate balanced to the right (3’) of the alternate allele

What calls do we “know” are poor (w.r.t. freebayes VCF annotations)?
- low-quality calls (QUAL < N)
+ also, maybe QUAL is high but QUAL / AO is low
- loci with low read depth (DP < N)
- alleles that are only seen on one strand
+ remove by “SAF > 0 & SAR > 0”
- alleles that are only observed by reads placed to the left or right
+ remove by “RPL > 0 & RPR > 0”

suggested freebayes hard filters…For diploid, humans…

QUAL > 1 & QUAL / AO > 10 & SAF > 0 & SAR > 0 & RPR > 1 & RPL > 1

  • QUAL > 1: removes horrible sites
  • QUAL / AO > 10 : additional contribution of each obs should be 10 log units (~ Q10 per read)
  • SAF > 0 & SAR > 0 : reads on both strands
  • RPR > 1 & RPL > 1 : at least two reads “balanced” to each side of the site

Look at the Filtered VCF File


How many variants are there after filtering?

before after

Descriptive statistics and quality control on variants


7. Variant Annotation and Prioritisation

Software

Tools

  • (w)ANNOVAR

Input/Output

  • ANNOVAR input:
    • vcf file
    • annotation databases (provided by galaxy)
  • ANNOVAR output:
    • annotated vcf file
    • text file with annotations

7.1 Variant Annotation and Prioritisation

At this point in the analysis of whole exome data from patient WES01, we have assessed the quality of raw sequence data, removed low quality reads, aligned sequence data to the reference genome, removed poorly mapped reads and duplicate reads, assessed the alignment process, called variants and evaluated variant calling. The aim of this practical is to complete the analysis of WES01 by annotating the variants and using a filtering strategy to generate a short-list of potentially pathogenic variants. At the end of this exercise you will be able to:

  1. Use ANNOVAR to annotate variants with respect to genes, databases of normal variation, and predictors of pathogenicity
  2. Apply basic variant filters to generate a list of candidate genes and likely causative variants
  3. Use phenotype information to improve annotation, filtering and prioritisation of likely causative variants

7.2 Annotating variants

The result of variant calling is a variant call file (VCF) which describes the location and data quality of each variant. However, the initial VCF file does not provide any information about the functional relevance of variants and their potential contribution to disease. To gain these insights we will use ANNOVAR (Wang et al 2010) to annotate each variant in the VCF file with respect to their location in relation to genes and coding sequences (exon, intron or intergenic), whether they change the coding sequence and if so how (missense, stop gain, synonymous, frameshift, amino acid consequence etc). In addition, we will cross reference the variants with databases of known variation (1000 genomes, dbSNP, Exome Sequencing Project and COSMIC) and predictions of functional significance (avsift and conservation scores). At this stage, it is important to be aware that the annotation result will vary according to the choice of annotation software (alternative software: SnpEff Cingolani et al 2012 and Variant Effect Predictor VEP McLaren et al 2010) and definition of transcripts (Ensemble Flicek et al 2012, RefSeq Pruitt et al 2012, Gencode Harrow et al 2012). This variability occurs because a single variant may have different effects in different transcripts (isoforms of the gene) and in some cases may effect different genes (genes can overlap, one on forward strand the other on the reverse strand) each with multiple transcripts. It is also possible to interpret a single variant as having multiple effects on a single transcript (Figure 1). The annotation software must therefore apply a set of rules to determine which variant takes precedence see here for ANNOVAR precedence rules but these rules vary between programs so that they generate different results. For example, the overall agreement for Loss of Function (LoF) variants annotated by ANNOVAR and VEP is just 64% (McCarthy et al 2014 : Choice of transcripts and software has a large effect on variant annotation).

Annotation examples from McCarthy et al 2014



The variant is an insertion of an A in a stop codon (TGA) in the penultimate base of the exon in all transcripts except one. It is possible to interpret this variant as a frameshift insertion or a stop-loss but it is actually a synonymous variant because the stop codon remains in the same place.

The choice of which transcript set to use for annotation has an even bigger effect on the annotation results. For example, the overlap between LoF variants is only 44% when the same software is used to annotate against transcripts from Ensembl or RefSeq (McCarthy et al 2014). This is because RefSeq transcripts are based on a collection of non-redundant mRNA models that have strong support and are manually curated. As a result, the RefSeq database is highly accurate but does not contain all possible transcripts or gene models (n=41,501 transcripts from RefSeq release 57 that were used by ANNOVAR). In comparison, Ensembl provides a less curated but more comprehensive list of transcripts (n=115,901 transcripts from Ensembl version 69 that were used by ANNOVAR) based on information from the Consensus Coding Sequence (CCDS, Pruitt et al 2009), Human and Vertebrate Analysis and Annotation (HAVANA), Vertebrate Genome Annotation (Vega, Ashurst et al 2005), ENCODE (The ENCODE Project Consortium 2012) and GENCODE (Searle et al 2010).


7.3 ANNOVAR in Galaxy

ANNOVAR is a rapid, efficient tool to annotate functional consequences of genetic variation from high-throughput sequencing data. wANNOVAR provides easy and intuitive web-based access to the most popular functionalities of the ANNOVAR software

This tool will annotate variants using specified gene annotations, regions, and filtering databases. Input is a VCF dataset, and output is a table of annotations for each variant in the VCF dataset or a VCF dataset with the annotations in INFO fields.

ANNOVAR Website: http://annovar.openbioinformatics.org
Paper: http://nar.oxfordjournals.org/content/38/16/e164

  1. In Tool Pane: Go to NGS: Variant Analysis > ANNOVAR Annotate VCF
  2. Select the filtered VCF file
  3. Select all annotations (this may differ depending on software versions and installed data bases)
  4. Select Output Data type as Tabular
  5. Execute



Now view the annotated VCF file by clicking ‘View data’ and familiarise yourself with the contents.




7.4 Filter and Prioritisation of variants

There are many ways to filter the data, one of the most common strategies is to select coding variants (c6 == ‘exonic’) that are either rare or absent from databases of known variation such as dbSNP (c20 == ‘.’).

  1. In Tool Pane: Go to Filter and Sort > Filter
  2. With the following condition: c20 == '.' and c6 == 'exonic'
  3. Execute
  4. View the output




The Case Study

The sequence data that you will be analysing is from a 3-year-old male who presented with difficulty walking after the first year of life, followed by signs of muscle wasting and weakness, muscle rigidity, developmental delays, progressive loss of vision leading to blindness, convulsions, impaired swallowing, paralysis, and dementia. An MRI scan showed a “tigroid” or “leopard-skin” appearance within the deep white matter - recognizable abnormalities of white matter. Further clinical tests suggest a diagnosis of Metachromatic leukodystrophy (MLD, also called Arylsulfatase A deficiency): is a lysosomal storage disease which is commonly listed in the family of leukodystrophies as well as among the sphingolipidoses as it affects the metabolism of sphingolipids.

The Filtered Variants


Based on the patients phenotypes and clinical diagnosis, which of the variants above do you think is the causative mutation? If you didn’t have detailed phenotype information and/or the diagnosis, would you have come to the same conclusion?


7.5 View the ARSA chr22:51065713 stopgain SNV in IGV

  • Download the bam file Download dataset and bam index Download bam_index
  • Download the Freebayes vcf file
  • Upload them to IGV


Filtered BAM Filtered VCF



For the causal variant use the VCF file from Freebayes and look at the Quality score and other fields used for filtering.


7.6. wANNOVAR

The online version of ANNOVAR offers some additional annotations and improved variant filtering and prioritisation using phenotype information.

A full Tutorial is avaiable here wANNOVAR

Download your VCF file and use wANNOVAR to perform annotation.

  1. Click the download icon and save the Freebayes filtered VCF file to your computer.
  2. Upload the VCF file to wANNOVAR
  3. Enter the phenotype: Metachromatic leukodystrophy
  4. Click Submit
  5. Open the wANNOVAR results in excel and look at the annotation for your variant of interest.
  6. Look at The prioritized genes from Phenolyzer and see where your variant is ranked. In less than a minute, ANNOVAR standard filtering in combination with Phenotype information, was able to prioritise and find the real mutation!


INFO: wANNOVAR Annotaions

NOTE: If you’re running out of time to finish the practical you can read the section below in your own time and skip to the last section on how to create a workflow.


SIFT SIFT score for non-synonymous variants are based on the degree of conservation of amino acid residues in sequence alignments derived from closely related sequences.

Polyphen-2 Polyphen-2 predicts the impact of an amino acid substitution on the structure and function of a protein. Predictions are based on features that characterise the substitution including the sequence (does the substitution occur within an annotated site in the protein e.g. active or binding site, non-globular region such as trans-membrane etc), phylogenetics (how often does the substitution occur in a family of related proteins) and structural information (could the substitution affect the proteins 3D structure e.g. hydrophobic core, electrostatic interactions, interactions with ligands or other important features). These features are assessed by a probabilistic classifier. Two versions of Polyphen are available, HDIV and HVAR, that used different datasets to train and test the prediction models. HDIV should be used for evaluating rare alleles involved in complex phenotypes where both disease causing and mildly deleterious alleles are treated as damaging. The HDIV dataset consists of all damaging alleles in the UniProtKb database that effect molecular function and cause human Mendelian diseases versus differences between human proteins and their closely related mammalian homologues which are assumed to be non-damaging.

HVAR should be used for diagnostics of Mendelian diseases which requires distinction between mutations with diagnostic effects from all remaining human variation, including a wealth of mildly deleterious alleles. The HVAR dataset used to train and test Polyphen consisted of all human disease causing mutations from UniProtKb together with common human non-synonymous SNPs (MAF>1%) without annotated involvement in disease which were treated as non-damaging.

LRT score Likelihood ratio test (LRT) for prediction of significantly conserved amino acid positions within the human proteome. Mutations are estimated as ‘deleterious’ that disrupt highly conserved aminoacid residues, ‘neutral’ or ‘unknown’.

Mutation Taster MutationTaster uses a Bayes classifier to predict the disease potential of an alteration. For this prediction, the frequencies of all single features for known disease mutations/polymorphisms were studied in a large training set composed of >390,000 known disease mutations from HGMD Professional and >6,800,000 harmless SNPs and Indel polymorphisms from the 1000 Genomes Project.

Mutation Assessor Mutation assessor scores estimate functional impact based on evolutionary conservation of the affected amino acids in protein homologs. The method has been validated on a large set (60k) of disease associated (OMIM) and polymorphic variants.

FATHMM score Functional Analysis through Hidden Markov Models. Predicts the functional effects of protein missense mutations by combining sequence conservation within hidden Markov models (HMMs), representing the alignment of homologous sequences and conserved protein domains, with “pathogenicity weights”, representing the overall tolerance of the protein/domain to mutations.

RadialSVM Support vector machine (SVM) based ensemble prediction score, which incorporates 10 scores (SIFT, PolyPhen-2 HDIV, PolyPhen-2 HVAR, GERP++, MutationTaster, Mutation Assessor, FATHMM, LRT, SiPhy, PhyloP) and the maximum frequency observed in the 1000 genomes populations. Larger value means the SNV is more likely to be damaging. Scores range from -2 to 3 in dbNSFP.

LR score LR scores evaluate the deleteriousness of missense mutations by using logistic regression to integrate information from:

  1. function prediction scores (PolyPhen-2, SIFT, MutationTaster, Mutation Assessor, FATHMM, LRT, PANTHER, PhD-SNP, SNAP, SNPs&GO, and MutPred)
  2. conservation scores (GERP++, SiPhy and PhyloP)
  3. ensemble scores (CADD, PON-P, KGGSeq and CONDEL)
  4. maximum minor allele frequency

VEST3 score Variant Effect Scoring Tool (VEST) is a supervised machine learning-based classifier for prioritisation of rare missense variants with likely involvement in human disease. The VEST classifier training set comprised ~ 45,000 disease mutations from the latest Human Gene Mutation Database release and another ~45,000 high frequency (allele frequency >1%) putatively neutral missense variants from the Exome Sequencing Project. VEST estimates variant score p-values against a null distribution of VEST scores for neutral variants not included in the VEST training set. These p-values can be aggregated at the gene level across multiple disease exomes to rank genes for probable disease involvement.

CADD raw Combined Annotation Dependent Depletion (CADD) is a framework that integrates multiple annotations into one metric by contrasting variants that survived natural selection (differences between 1000 Genomes and the Ensembl Compara inferred human-chimpanzee ancestral genome) with simulated mutations. Raw CADD scores come straight from the model and have no absolute unit of meaning and are incomparable across distinct annotation combinations, training sets, or model parameters. However, raw values have relative meaning with larger scores indicating that a variant is more likely to have a damaging effect.

CADD phred
Phred CADD scores are ranked scores based on the whole genome CADD raw scores. For example, variants at the 10th-% of raw CADD scores are assigned to CADD-10, top 1% to CADD-20, top 0.1% to CADD-30, etc. The results of this transformation are Phred-like scaled CADD scores.

Genomic Evolutionary Rate Profiling Rejected Substitution scores (GERP RS). GERP identifies constrained elements in multiple alignments by quantifying substitution deficits for SNPs. Scores range from -12.3 to 6.17, with 6.17 being the most conserved. Scores greater than 2 provide a high sensitivity while still strongly enriched for truly constrained sites.

PhyloP46way placental Phylogenetic conservation score based on a multiple alignment of 46 vertebrate genomes (10 primates, 33 placental mammals).

PhyloP100way vertebrate Phylogenetic conservation score based on a multiple alignment of 100 vertebrate genomes.

SiPhy 29way logOdds The estimated stationary distribution of A, C, G and T at the site, using SiPhy algorithm based on 29 mammals genomes.


8. Galxy Workflows


8.1 Create and apply a workflow from your Galaxy history

Please work through and read:Creating Workflows and Advanced Workflow Options

Reproducing an experiment on the same data or different datasets and comparing the results is a key feature of research that requires keeping a meticulous set of instructions. In many cases, multiple replicates are required and the best approach is an automated analysis that is consistent and less error prone. In Galaxy, the history acts as a detailed list of experimental instructions that can be reproduced and automated by creating a ‘Workflow’ or bioinformatic pipeline.

  1. Before making your workflow delete any unwanted steps from the history so that it only includes the steps needed to make the final result.

  2. Go to History panel, click the cog icon and select ‘Extract workflow’….


See Extract Workflow from a History: and follow the instructions there.

Name your workflow : Alignment_Variant_Calling_and_Filtering-v0.1 (or something similar, maybe add your initials)


8.2 Running workflows

Now that you have made a workflow/bioinformatic pipeline you can apply the whole process to a new dataset with just a few clicks. Try your workflow on the original fastq files.

  1. In the history pane click the cog icon and select ‘Create New’ to make a new history.

  2. Before running the workflows, we need to put the raw data files into the new history. You can do this by uploading or copying the data from an existing dataset. We will copy the input files by clicking the cog icon and selecting ‘Copy Datasets’

  3. Select ‘WES01 NGS Workshop’ as the source history and the new ‘Unnamed history’ as the destination.

  4. Check the boxes for the 3 input files that we need: 1) target bed file: chr22.genes.hg19.bed., raw fastqs: 2) WES01_chr22m_R1.fastq, 3) WES01_chr22m_R2.fastq.

  5. Scroll to the bottom of the page and click ‘Copy History Items’.

  6. Click the workflow tab then click your ‘Alignment_Variant_Calling_and_Filtering’ workflow and select run from the drop down menu.

  7. Now choose the relevant input files and click ‘Run workflow’ to analyse the data.


Copying Data sets to create a new History
  1. Select the cog icon at the top of your history panel
  2. Select Copy Data Sets
  3. Copy 3 input files that we need: 1) target bed file: chr22.genes.hg19.bed., raw fastqs: 2) WES01_chr22m_R1.fastq, 3) WES01_chr22m_R2.fastq to a new history



Data Set Selection and Order: running Galaxy workflow


note the order of selected data sets

1 = *.bed file  
2 = *.R1.fastq  
3 = *.R2.fastq  


When the workflow has finished view the output and check that the result matches your original analysis.



9 . Sharing and publishing your work

For part of the assessment, you will share your history and workflows with the examiner by submitting them as web links in your NGS report. We will now practice how to make these web links by sharing your current history and workflows.

1. To share your history, click the cog icon and select ‘Share or Publish’ then click ‘Make History Accessible via Link’. This will generate a web link that you can share with others so they can import your history into their Galaxy account.




2. To share your workflow, click the workflow tab, then click your workflow and select ‘Share or Download’ from the drop down menu.
3. Click ‘Make workflow Accessible via Link’ to generate a web link that can be used by others to import your workflow and apply it to their own data.




The End


Based on a workshop produced by Will Tapper, University of Southampton (shared 2016)